Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Risk modeling with electronic health records (EHR) data is challenging due to no direct observations of the disease outcome and the high-dimensional predictors. In this paper, we develop a surrogate assisted semi-supervised learning approach, leveraging small labeled data with annotated outcomes and extensive unlabeled data of outcome surrogates and high-dimensional predictors. We propose to impute the unobserved outcomes by constructing a sparse imputation model with outcome surrogates and high-dimensional predictors. We further conduct a one-step bias correction to enable interval estimation for the risk prediction. Our inference procedure is valid even if both the imputation and risk prediction models are misspecified. Our novel way of ultilizing unlabelled data enables the high-dimensional statistical inference for the challenging setting with a dense risk prediction model. We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.


Introduction
Precise risk prediction is vitally important for successful clinical care. High risk patients can be assigned to more intensive monitoring or intervention to improve outcome. Traditionally, risk prediction models are developed based on cohort studies or registry data.
Population-based disease registries, while remain a critical source for epidemiological studies, collect information on a relatively small set of pre-specified variables and hence may limit researchers' ability to develop comprehensive risk prediction models (Warren and Yabroff, 2015). Most clinical care is delivered in healthcare systems (Thompson et al., 2015), and electronic health records (EHR) embedded in healthcare systems accrue rich clinical data in broad patient populations. EHR systems centralize the data collected during routine patient care including structured elements such as codes for International Classification of Diseases (ICD), medication prescriptions, and medical procedures, as well as free-text narrative documents such as physician notes and pathology reports that can be processed through natural language processing (NLP) for analysis. EHR data is also often linked with biobanks which provide additional rich molecular information to assist in developing comprehensive risk prediction models for a broad patient population.
Risk modeling with EHR data, however, is challenging due to several reasons. First, precise information on clinical outcome of interest, Y , is often embedded in free-text notes and requires manual efforts to extract accurately. Readily available surrogates of Y , S, such as the diagnostic codes or mentions of the outcome are at best good approximations to the true outcome Y . For example, using EHR data from Mass General Brigham (MGB), we found that the PPV was only 0.48 and 0.19 for having at least 1 ICD code of T2DM and for having at least 1 NLP mention of T2DM, respectively. Directly using these EHR proxies as true disease status to derive risk models may lead to substantial biases. On the other hand, extracting precise disease status requires manual chart review which is not feasible at a large scale. It is thus of great interest to develop risk prediction models under a semi-supervised learning (SSL) framework using both a large unlabeled dataset of size N containing information on predictors X along with surrogates S and a small labeled dataset of size n with additional observations on Y curated via chart review.
Additional challenges arise from the high dimensionality of the predictor vector X, and the potential model mis-specifications. Although much progress has been made in high dimensional regression in recent years, there is a paucity of literature on high dimensional inference under the SSL setting. Precise estimation of the high dimensional risk model is even more challenging if the risk model is not sparse. Allowing the risk model to be dense is particularly important when X includes genomic markers since a large number of genetic markers appear to contribute to the risk of complex traits (Frazer et al., 2009).
For example, Vujkovic et al. (2020) recently identified 558 genetic variants as significantly associated with T2DM risk. An additional challenge arises when the fitted risk model is mis-specified, which occurs frequently in practice especially in the high dimensional setting. Model mis-specifications can also lead to the fitted model of Y | X to be dense.
There are limited methods currently available to make inference about high dimensional risk prediction models in the SSL setting especially under a possibly mis-specified dense model. In this paper, we fill in the gap by proposing an efficient surrogate assisted SSL (SAS) prediction procedure that leverages the fully observed surrogates S to make inference about a high dimensional risk model under such settings.
Under the supervised setting where both Y and X are fully observed, much progress has been made in recent years in the area of high dimensional inference. High dimensional regression methods have been developed for commonly used generalized linear models (GLM) under sparsity assumptions on the regression parameters (van de Geer and Bühlmann, 2009;Negahban et al., 2010;Huang and Zhang, 2012). Recently, Zhu and Bradic (2018b) studied the inference of linear combination of coefficients under dense linear model and sparse precision matrix. Inference procedures have also been developed for both sparse (Zhang and Zhang, 2014;Javanmard and Montanari, 2014;van de Geer et al., 2014) and dense combinations of the regression parameters Zhu and Bradic, 2018a). Highdimensional inference under the logistic regression model has also been studied recently (van de Geer et al., 2014;Ma et al., 2020;Guo et al., 2020).
Under the SSL setting with n N , however, there is a paucity of literature on high dimensional inference. Although the SSL can be viewed as a missing data problem, it differs from the standard missing data setting in a critical way. Under the SSL setting, the missing probability tends to 1, which would violate a key assumption required in the missing data literature (Bang and Robins, 2005;Smucler et al., 2019;Chakrabortty et al., 2019, e.g.). Existing work on SSL with high-dimensional covariates largely focuses on the postestimation inference on the global parameters under sparse linear models with examples including SSL estimation of population mean (Zhang et al., 2019;Zhang and Bradic, 2019), the explained variance (Cai and Guo, 2018), and the average treatment effect (Cheng et al., 2018;Kallus and Mao, 2020). To the best of our knowledge, our SAS procedure is the first to conduct the semi-supervised inference of the high-dimensional coefficient and the individual prediction in the high-dimensional dense and possibly mis-specified risk prediction model.
Our proposed estimation and inference procedures are as follows. For estimation, we first use the labelled data to fit a regularized imputation model with surrogates and highdimensional covariates; then we impute the missing outcomes for the unlabeled data and fit the risk model using the imputed outcome and high-dimensional predictors. For inference, we devise a novel bias correction method, which corrects the bias due to the regularization for both imputation and estimation. For our proposed methods, we allow the fitted risk model for Y | X to be both mis-specified and potentially dense but only require sparsity on the fitted imputation model of Y | S, X. The sparsity assumption on the imputation model is less stringent since we anticipate that most information on Y can be well captured by the low dimensional S while the fitted model of Y | X might be dense especially under possible model mis-specifications.
The remainder of the paper is organized as follows. We introduce our population parameters and model assumptions in Section 2. In Section 3, we propose the SAS estimation method along with its associated inference procedures. In Section 4, we state the theoretical guarantees of the SAS procedures, whose proofs are provided in the Supplementary Materials. We also remark on the sparsity relaxation and the efficiency gain of the SSL.
In Section 5, we present simulation results highlighting finite sample performance of the SAS estimators and comparisons to existing methods. In Section 6, we apply the proposed method to derive individual risk prediction for T2DM using EHR data from MGB.

Settings and Notations
For the i-th observation, Y i ∈ R denotes the outcome variable, S i ∈ R q denotes the surrogates for Y i and X i ∈ R p+1 denotes the high-dimensional covariates with the first element being the intercept. Under the SSL setting, we observe n independent and identically dis- .., n} and (N − n) i.i.d unlabeled observations, U = {W i = (X T i , S T i ) T , i = n + 1, ..., N }. We assume that the labeled subjects are randomly sampled by design and the proportion of labelled sample is n/N = ρ ∈ (0, 1) with ρ → 0 as n → ∞. We focus on the high-dimensional setting where dimensions p and q grow with n and allow p + q to be larger than n. Motivated by our application, our main focus is on the setting N much larger than p, but our approach can be extended to N ≤ p under specific conditions.
To predict Y i with X i , we consider a possibly mis-specified working conditional mean model with a known monotone and smooth link function g, (1) Our procedure generally allows for a wide range of link functions and detailed requirements on g(·) and its anti-derivative G are given in Section 4. In our motivating example, Y is a binary indicator of T2DM status and g(x) = 1/(1 + e −x ) with G(x) = log(1 + e x ). Our goal is to accurately estimate the high-dimensional parameter β 0 , defined as the solution of the estimation equation We shall further construct confidence intervals for g(β T 0 x new ) with any x new ∈ R p+1 . The predicted outcome g(β T 0 x new ) is the condition mean of Y given X i = x new when (1) holds and can be interpreted as the best generalized linear prediction with link g even when (1) fails to hold. To enable estimation of β 0 under possible model mis-specification, we define a pseudo log-likelihood (PL) function such that ∂ (y, β T X i )/∂β = X T i {y − g(β T X i )} corresponds to the moment condition (2).  We make no assumption on the sparsity of β 0 and hence it is not feasible to perform valid supervised learning for β 0 when s β = β 0 0 > n.
We shall derive an efficient SSL estimate for β 0 by leveraging U . To this end, we fit a working imputation model with the population parameter γ 0 defined as The definition of γ guarantees and hence if we impute = 0 regardless the adequacy of the imputation model (4). It is thus feasible to carry out an SSL procedure by first deriving an estimate forȲ i using the labelled data L and then regressing the estimated Ȳ i against X i using the whole data L ∪ U . Although we do not require β 0 to be sparse or any of the fitted models to hold, we do assume that γ 0 defined in (5) to be sparse. When the surrogates S are strongly predictive for the outcome, the sparsity assumption on γ 0 is reasonable since the majority of the information in Y can be captured in S.
Notations. We focus on the setting where min{n, p + q, N } → ∞. For convenience, we shall use n → ∞ in the asymptotic analysis. For two sequences of random variables A n and B n , we use A n = O p (B n ) and A n = o p (B n ) to denote lim c→∞ lim n→∞ P(|A| ≥ c|B|) = 0 and lim c→0 lim n→∞ P(|A| ≥ c|B|) = 0, respectively. For two positive sequences a n and b n , a n = O(b n ) or b n a n means that ∃C > 0 such that a n ≤ Cb n for all n; a n b n if a n = O(b n ) and b n = O(a n ), and a n b n or a n = o(b n ) if lim sup n→∞ a n /b n = 0. We use Z n L → N (0, 1) to denote the sequence of random variables Z n converges in distribution to a standard normal random variable.

SAS Estimation of β 0
The SAS estimation procedure for β 0 consists of two key steps: (i) fitting the imputation model to L to obtain estimate γ for γ 0 defined in (5); and (ii) estimating β 0 in (2) by In Step (i), we estimate γ 0 by the L 1 regularized PL estimator γ, defined as where a −1 denotes the sub-vector of all the coefficients except for the intercept and The imputation loss (8) corresponds to the negative log-likelihood when Y is binary and the imputation model holds with g being anti-logit. With γ, we impute the unobserved outcomes for subjects in U as Y i = g( γ T W i ), for n + 1 ≤ i ≤ N . In Step (ii), we estimate β 0 by β = β( γ), defined as, (3). (10) We denote the complete data PL of the full data as and define the gradients of the various losses (8)-(11) aṡ

SAS Inference for Individual Prediction
Since g(·) is specified, the inference on g(x T new β) immediately follows from the inference on x T new β. We shall consider the inference on standardized linear prediction x T std β with the standardized covariates and then scale the confidence interval back. This way, the scaling with x new 2 is made explicit in the expression of the confidence interval.
The estimation error of β can be decomposed into two components corresponding to the respective errors associated with (7) and (9). Specifically, we write whereβ( γ) is defined as the minimizer of the expected imputed loss conditionally on the labeled data, that is,β The termβ( γ) − β 0 denotes the error from the imputation model in (7) while the term β −β( γ) denotes the error from the prediction model in (9) given the imputation model parameter γ. As 1 penalization is involved in both steps, we shall correct the regularization bias from the two sources. Following from the typical one-step debiasing LASSO (Zhang and Zhang, 2014) , the inverse Hessian of † (·; γ) at β = β 0 . The bias correction forβ( γ) − β 0 requires some innovation since we need to conduct the bias correction for a nonlinear functionalβ(·) of LASSO estimator γ, which has not been studied in the literature. We identifyβ( γ) and β 0 by the first order moment conditions, Here E i>n [· | L ] denotes the conditional expectation of a single copy of the unlabeled data given the labelled data. By equating the two estimating equations in (15), we apply the first order approximation and approximate the differenceβ( γ) − β 0 bȳ Together with the bias correction forβ( γ) − β 0 , this motivates the debiasing procedure The 1 − ρ factor, which tends to one when n much smaller than N , comes from the proportion of unlabeled data whose missing outcome are imputed.
For theoretical considerations, we devise a cross-fitting scheme in our debiasing process.
We split the labelled and unlabeled data into K folds of approximately equal size, respectively. The number of folds does not grow with dimension (e.g. K = 10). We denote the indices sets for each fold of the labelled data L as I 1 , . . . , I K , and those of the unlabeled data U as J 1 , . . . , J K . We denote the respective sizes of each fold in the labelled data and full data as n k = |I k | and N k = n k + |J k |, where |I| denotes the carnality of I. Define I c k = {1, . . . , n} \ I k and J c k = {n + 1, . . . , N } \ J k . For each labelled fold k, we fit the imputation model with out-of-fold labelled samples: Using γ (k) , we fit the prediction model with the out-of-fold data I c k ∪ J c k : To estimate the projection we propose an L 1 -penalized estimator where β (k,k ) is trained with samples out of folds k and k , with γ (k,k ) = argmin The estimators in (21) take similar forms as those in (17) and (18) except that their training samples exclude two folds of data I k ∪ J k and I k ∪ J k . In the summand of (20), trained without folds k and k . The estimation of u requires an estimator of β and both estimators are subsequently used for the debiasing step. Using the same set of data multiple times for β, u, debiasing and variance estimation may induce over-fitting bias, so we implemented the cross-fitting scheme to reduce the over-fitting bias. As a remark, cross-fitting might not be necessary for theory with additional assumptions and/or empirical process techniques.
We obtain the cross-fitted debiased estimator for x T std β as x T std β, defined as The second term is used to correct the biasβ( γ) − β 0 and the third term is used to correct the bias β −β( γ). The corresponding variance estimator is Through the link g and the scaling factor x new 2 , we estimate g( where Z α/2 is the 1 − α/2 quantile of the standard normal distribution.

Theory
We introduce assumptions required for both estimation and inference in Section 4.1. We state our theories for estimation and inference, respectively in Sections 4.2 and 4.3.

Assumptions
We assume the complete data consist of i.i.d. copies of (Y i , X i , S i ), for i = 1, . . . , N . For our focused SSL settings, only the first n outcome labels Y i , . . . , Y n are observed. Under the i.i.d assumption, our SSL setting is equivalent to the missing completely at random (MCAR) assumption. The sparsities of γ 0 , β 0 and u 0 are denoted as We focus on the setting with n, p+q, N → ∞ with n being allowed to be smaller than p+q.
We allow that s γ , s β and s u grow with n, p + q, N and satisfy s γ n and s β + s u N .
To achieve the sharper dimension conditions, we consider the sub-Gaussian design as in Portnoy (1984Portnoy ( , 1985; Negahban et al. (2010). We denote the sub-Gaussian norm for random variables and random vectors both as · ψ 2 . The detailed definition is given in Appendix D.
Assumption 1. For constants ν 1 , ν 2 and M independent of n, p and N , b) The link function g satisfies the monotonicity and smoothness conditions: inf x∈R g (x) ≥ 0, sup x∈R g (x) < M and sup x∈R g (x) < M .
Under our motivating example with a binary Y i and g(x) = e x /(1 + e x ), 1a and 1b are satisfied. The condition is also satisfied for the probit link function and the identity link function. Condition 1a is universal for high-dimensional regression. Admittedly, Lipschitz requirement in 1b rules out some GLM links with unbounded derivatives like the exponential link, but we may substitute the condition by assuming a bounded X i ∞ .
Assumption 2. For constants σ 2 max and σ 2 min independent of n, p, N , a) W i is a sub-Gaussian vector with sub-Gaussian norm W i ψ 2 ≤ σ max / √ 2; b) The weak overlapping condition at the population parameter β 0 and γ 0 , c) The non-degeneracy of average residual variance: Assumption 2a is typical for high-dimensional regression (Negahban et al., 2010), which also implies the bounded maximal eigenvalue of the second moment Notably, we do not require two common conditions under high-dimensional generalized linear models (Huang and Zhang, 2012;van de Geer et al., 2014): 1) the upper bound on sup i=1,...,N X i ∞ ; 2) the lower bound on inf i=1,...,N g (β T 0 X i ), often known as the overlapping condition for logistic regression model. Compared to the overlapping condition under logistic regression that g(β T 0 X i ) and g(γ T 0 W i ) are bounded away from zero, our Assumptions 2b and 2c are weaker because they are implied by the typical minimal eigenvalue plus the overlapping condition.

Consistency of the SAS Estimation
We now state the L 2 and L 1 convergence rates of our proposed SAS estimator.
For the setting N n, our requirement on the sparsity of β, s β = o(N/ log(p)) is significantly weaker than s β = o(n/ log(p)), which is known as the fundamental sparsity limit to identify the high-dimensional regression vector in the supervised setting. Theorem 1 indicates that with assistance from observed S ∈ U , the SAS procedure allows s β > n provided that N is sufficiently large and the imputation model is sparse. This distinguishes our result from most estimation results in high-dimensional supervised settings.
Second, we briefly discuss the L 1 consistency. If the L 1 consistency is of interest, the penalty levels are chosen as which produces the L 1 estimation rate from Theorem 1 Compared to the condition for L 1 consistency under supervised learning, s β = o n/ log(p) , the condition from SAS estimation s β = o ((n/s γ + N )/ log(p)) allows a denser β 0 in the setting with a very sparse γ 0 and a large unlabeled data. On the other hand, the L 2 estimation rate in Theorem 1 remains the same if log(p)/N λ β max log(p)/N , s γ /s β λ γ .
We shall point out that our subsequent theory on the SAS inference procedure is based the L 2 consistency, instead of L 1 consistency.
Theorem 1 implies the following prediction consistency result.
Corollary 2 (Consistency of individual prediction). Suppose x new is sub-Gaussian random Under the conditions of Theorem 1, we have The concentration result of Corollary 2 is established with respect to the joint distribution of the data and the new observation x new . This is in a sharp contrast to the individual prediction conditioning on any new observation x new . If the goal is to conduct inference for any given x new , the theoretical justification is provided in the following Theorem 3 and Corollary 4.

√ n-inference with Debiased SAS Estimator
We state the validity of our SSL inference in Theorem 3. We use to A L → B to denote that random variable A converges in distribution to a distribution B.
Theorem 3 (SAS Inference). Let x new be the random vector representing the covariate of a new individual. Under Assumptions 1, 2 and the dimension condition where V 2 SAS defined in (23) is the estimator of the asymptotic variance By the Young's inequality, the condition (28) is implied by When p is much smaller than the full sample size N , our condition (30) allows the sparsity levels of β 0 and u 0 to be as large as p. Even if p is larger than N , our SAS inference procedure is valid if s β + s u √ N / log(p). In the literature on confidence interval construction in high-dimensional supervised setting, the valid inference procedure for a single regression coefficient in the linear regression requires s β √ n/ log(p) (Zhang and Zhang, 2014;Javanmard and Montanari, 2014;van de Geer et al., 2014). Such a sparsity condition has been shown to be necessary to construct a confidence interval of a parametric rate (Cai et al., 2017). We have leveraged the unlabeled data to significantly relax the fundamental limit of statistical inference from s β √ n/ log(p) to s β √ N / log(p). The amount of labelled data validates the statistical inference for a dense model in high dimensions.
The sparsity of u 0 is determined by x new and the precision matrix Θ 0 . In the supervised learning setting, for confidence interval construction for a single regression coefficient, van de Geer et al. (2014) requires s u n/ log(p) is required. According to (30), our SAS inference requires s u √ N / log(p), which can be weaker than s u n/ log(p) if the amount of unlabeled data is larger than n 2 . Theorem 3 implies that our proposed CI in (24) is valid in terms of coverage, which is summarized in the following corollary.
Corollary 4. Under Assumptions 1 and 2, as well as (28), the CI defined in (24) satisfies, where V SAS is the the asymptotic variance defined in (29).
Confidence interval construction for g (x T new β 0 ) in high-dimensional supervised setting has been recently studied in Guo et al. (2020). Guo et al. (2020) assumes the prediction model to be correctly specified as a high-dimensional sparse logistic regression and the inference procedure is valid if s β √ n/ log p. In contrast, we leverage the unlabeled data to allow for mis-specified prediction model and a dense regression vector, as long as the dimension requirement in (28) is satisfied.

Efficiency comparison of SAS Inference
Efficiency in high-dimensional setting or SSL setting in which the proportion of labelled data decays to zero is yet to be formalized. Here we use the efficiency bound in the classical low-dimensional with a fixed ρ as the benchmark. Apart from the relaxation of various sparsity conditions, we illustrate next that our SAS inference achieves a decent efficiency with properly specified imputation model compared to the supervised learning and the benchmark.
Similar to the phenomenon discovered by Chakrabortty and Cai (2018), if the imputation model is correct, we can guarantee the efficiency gain by SAS inference in comparison to the asymptotic variance of the supervised learning, Moreover, we can show that our SAS inference attains the benchmark efficiency derived from classical fixed ρ setting (Tsiatis, 2007). To simplify the derivation, we describe the missing-completely-at-random mechanism through the binary observation indicator We still denote the proportion of labelled data as The unsorted data take the form We consider the following class of complete data semi-parametric models and establish the efficiency bounds for RAL estimators under M comp by deriving the associated efficient influence function in the following proposition. We denote the nuisance parameters for f Y |S,X , f X and f S|X as η. We use η 0 to denote the true underlying nuisance parameter that generates the data. The parameter of interest β 0 is not part of the model M comp but defined by the implicit function through the moment condition (2).
Under the Assumptions of Theorem 3 and additionally E(Y i | S i , X i ) = g(γ T 0 W i ), our SAS debiased estimator admits the same influence function Step 2 (A.31).

Simulation
We have conducted extensive simulation studies to evaluate the finite sample performance of the SAS estimation and inference procedures under various scenarios. Throughout, we let p = 500, q = 100, N = 20000 and consider n = 500. The signals in β are varied to be approximately sparse or fully dense with a mixture of strong and weak signals. The surrogates S are either moderately and strongly predictive of Y as specified below. For each configuration, we summarize the results based on 500 simulated datasets.
To mimic the zero-inflated discrete distribution of EHR features, we first generate We standardize X i,j to roughly mean zero and unit variance with µ X = 1.80 and σ X = 2.74.
The shared term Z u i induces correlation among the covariates. For S and Y , we consider two scenarios under which the imputation model is either correctly or incorrectly specified. We present the "Scenario I: neither the risk prediction model nor the imputation model is correctly specified" in the main text and the "Scenario II: The imputation model is correctly specified and exactly sparse" in Section A of the Supplementary materials.
Scenario I: neither the risk prediction model nor the imputation model is correctly specified. In this scenario, we first generate Y i from the probit model and then generate S from We chose µ S and σ S depending on α such that S i,1 is roughly mean 0 and variance 1.
Under this setting, a logistic imputation model would be misspecified but nevertheless approximately sparse with appropriately chosen ξ. The coefficients α control the optimal prediction accuracy of X for Y while θ controls the optimal prediction accuracy of S for Y . We consider two α of different sparsity patterns, which also determine the rest of parameters Sparse (s α = 3) : where a k×1 = (a, ..., a) T k×1 for any a. The sparsity of α affects the approximate sparsity of β subsequently (Table 1), which we measured by the squared ratio between 1 norm and We consider two θ: (a) θ = 0.6 for S to be moderately predictive of Y ; and (b) θ = 1 for strong surrogates. The parameter ξ depends on both the choices of α and θ: Due to the complexity of the data generating process and the noncollapsibility of the logistic regression models, we cannot analytically express the true β 0 in both scenarios.
Instead, we numerically evaluate β 0 with a large simulated data using the oracle knowledge of the ex-changeability among covariates according to the model We derive the true β 0 as We report the simulation settings under Scenario I in Table 1, where we present the predictive power of the oracle estimation and the lasso estimation. We also report the average area-under-curve (AUC) of the receiver operating characteristic (ROC) curve for oracle β 0 , supervised LASSO (SLASSO) and the proposed SAS estimation. Our SAS estimation achieves a better AUC compared to supervised LASSO across all scenarios, and is comparable to the AUC with the true coefficient β 0 . Besides, we observe that the AUC of supervised LASSO is sensitive to the approximate sparsity S(β 0 ), while the AUC of SAS estimation does not seem to be affected by S(β 0 ).
To evaluate the SAS inference for the individualized prediction, we consider six different from a random sample of x new generated from the distribution of X i such that their predicted risks are around 0.2, 0.5, and 0.7, corresponding to low, moderate and high risk. We additionally consider three sets of x new with different levels of sparsity: Sparse: Intermediate: Dense: In Table 2, we compare our SAS estimator of x T new β 0 with the corresponding SLASSO across all settings under Scenario I. The root mean-squared-error (rMSE) of the SAS estimation decays proportionally with the sample size, while the rMSE of the supervised LASSO provides evidence of inconsistency for moderate and dense deterministic x new . The bias of the supervised LASSO is also significantly larger than that of the SAS estimation. The performance of the SAS estimation is insensitive to sparsity of β 0 , while that of supervised LASSO severely deteriorate with dense β 0 . The improvement from the supervised LASSO to the SAS estimation is regulated by the surrogate strength.
In Table 3, we compare our SAS inference with supervised debiased LASSO across the settings under Scenario I. Our SAS inference procedure attains approximately honest coverage of 95 % confidence intervals for all types of x new under all scenarios. Unsurprisingly, the debiased SLASSO has under coverage for the deterministic x new as the consequence of violation to the sparsity assumption for β 0 and precision matrix. Under our design, the first covariate X 1 has the strongest dependence upon the other covariates, whose associated row in the precision matrix is thus densest. Consequently, the inference for β T x S new = β 0 + β 1 The debiased SLASSO also has an acceptable coverage for random x L new , x M new , x H new sampled from the covariate distribution despite the presence of substantial bias, which we attribute to the even larger variance that dominates the bias. In contrast, our SAS inference has small bias across all scenarios and improved variance from the strong surrogate.
According to Tables A1, A2 and A3 in the Appendix A, the results under Scenario II are consistent with our findings under Scenario I.   We aim to develop a risk prediction model for Y by fitting a working model P (Y = 1 | X) = g(β T 0 X), where the baseline covariate vector X includes age, gender, indicator for occurrence of ICD code and NLP counts for obesity, hypertension, coronary artery disease (CAD), hyperlipidemia during the first year window, as well as a total of 49 single nucleotide polymorphism (SNP) previously reported as associated with T2DM in Mahajan et al. (2018) with odds ratio greater than 1.1. We additionally adjust for follow up by including log(C) and allow for non-linear effects by including two-way interactions between the SNPs and other baseline covariates. All variables with less than 10 nonzero values within the labelled set are removed, resulting the final covariates to be of dimension p = 260.
We standardize the covariates to have mean 0 and variance 1. To impute the outcome, we used the predicted probability of T2DM derived from the unsupervised phenotyping method MAP (Liao et al., 2019), which achieves an AUC of 0.98, indicating a strong surrogate. In addition to the proposed SAS procedure, we derive risk prediction models based on the supervised LASSO with both the same set of covariates. We let K = 5 in cross-fitting and use 5-fold cross-validation for tuning parameter selection. To compare the performance of different risk prediction models, we use 10-fold cross-validation to estimate the out-of-sample AUC. We repeated the process 10 times and took average of predicted probabilities across the repeats for each labelled sample and method in comparison.
In Figure 2, we present the estimated β coefficients for the covariates that received pvalue less than 0.05 from the SAS inference. The confidence intervals are generally narrower from the SAS inference. For the coefficients of baseline age and follow-up time, the SAS inference produced much narrower confidence interval than debiased SLASSO, which are expected to have a positive effect on the T2DM onset status during the observation. In addition, the SAS inference identified one global genetic risk factor and 6 other subgroup genetic risk factors while SLASSO identified none of these.
In Table 4, we present the AUCs of the estimated risk prediction models using the high dimensional X . It is important to note that AUC is a measurement of prediction  accuracy, so debiasing might lead to worse AUC by accepting larger variability for reduced bias. The AUC from SLASSO is very poor, probably due to the over-fitting bias with the small sample sizes of the labeled set. With the information from a large unlabeled data, SAS produced the significantly higher AUC than the SLASSO.
For illustration, we present in Figure 3 the individual risk predictions with 95% confidence intervals for three sets of 10 patients with each set randomly selected from low (< 5%), medium (5% ∼ 15%) or high risk (> 15%) subgroups. These risk groups are

Supplementary Material
We present the simulation Scenario II in which the imputation model is correctly specified and exactly sparse in Appendix A. The proofs of Theorems 1, 3, Corollary 2 and Propositions 5 and 6 are given in Appendix B. The technical details are put in Appendix C.
Definitions and existing results are stated in Appendix D. Scenario II: The imputation model is correctly specified and exactly sparse. In the second scenario, we first generate S i from

A Additional Simulation
. . , p, and then generate Y i from a sparse model We chose µ S ≈ 0.66 and σ S ≈ 1 such that S i,1 is roughly mean 0 and variance 1. Under this setting, the imputation model holds with s γ = 1. The factor ν and the coefficients α control the predictiveness of X for S 1 and Y while θ controls the predictiveness of W for Y . We consider two α of different sparsity patterns, where a k×1 = (a, ..., a) T k×1 for any a. Similar to Scenario I, the sparsity of α regulates the approximate sparsity of β measured by (33) (See Table 1). We consider two sets of (ν, θ) to allow W to be either moderately or strongly predictive of Y : Moderate: ν = 0.4, θ = 2; and Strong: ν = 0.6, θ = 3.7.
The layouts of Tables A2 and A3 are different from those of 2 and Table 3 because of the different data generating mechanism. The distribution of Y i | X i is not affected by the distribution of S i in Scenario I, while the property does not hold in Scenario II.

B Proofs of Main Results
We first summarize below notations used Section 3 for the conditional expectations given different part of the data.
Definition A1. The conditional expectation for samples with index in set S conditionally on subset of the data D is denoted as We denote the conditional expectation of unlabeled data given labelled data by E i>n {f (W i ) | L } and the conditional probability of new copy of data given current data by P new {f (W i ) |

D}.
With L and U partitioned into K folds indexed respectively by {I k , k = 1, ..., K} and {J k , k = 1, ..., K}, we denote the conditional expectation of fold-k labelled data and unlabeled data given the out-of-fold data respectively by and discuss which part dominates the estimation error. When the first variance term in (A.1) is dominant, the bias from γ becomes eligible. Then, we should recover the usual error bound for LASSO as if γ 0 is used. When the second bias term in (A.1) is dominant, the error bound of β can be controlled by the error bound of γ. Combining the error bounds in the two cases, we obtain the oracle inequalities.
Lemma A1. On event we have the oracle inequalities for estimation error of β, The constants κ rsc,1 , κ rsc,2 are the restrictive strong convexity parameters specified in Lemma A6.
In this case, the estimation error is dominated by γ − γ 0 . We simply have from (A.11) Thus, we have If case 1 does not hold, then instead In this case, the estimation error is comparable to that when we have the true γ 0 for the imputation. Thus, the sparsity of β 0 may affect the estimation error.
Following the typical approach to establish the cone condition for δ, we analyze the symmetrized Bregman's divergence, Due to the convexity of the loss˙ † (·; γ) under Assumption 1b, the symmetrized Bregman's divergence (A.14) is nonnegative through a mean-value theorem, Denote the indices set of nonzero coefficient in β 0 as O β = {j : β 0,j = 0}. We denote the δ O β and δ O c β as the sub-vectors for δ at positions in O β and at positions not in O β , respectively. The solution β satisfies the KKT condition From the KKT condition and the definitions of δ and O β , we have Applying the (A.15) to (A.14), we have the upper bound, Then, we apply (A.10), the definition of λ β and (A.13), Therefore, we can bound the L 1 norm of δ by the cone property, We then apply the cone condition (A.16) and the case condition (A.13) to the bound (A.11), tκ rsc,1 ≤ 14 √ s β λ β , and tκ rsc,1 δ 1 ≤ 84s β λ β Thus, we obtain the rate for estimation error β − β 0 2 ≤ 14 √ s β λ β /κ rsc,1 , and β − β 0 1 ≤ 84s β λ β /κ rsc,1 . (A.17) Since Case 1 and Case 2 are the complement of each other, one of them must occur.
Thus, the bound of estimation error is controlled by the larger bound in the two cases, which is our oracle inequality in Lemma A1.
Consistency We next show that the oracle inequality leads to the consistency under dimension condition (26). To show we express the term of interest as the sum of the following empirical processeṡ Under Assumption 1a and 1a, X i and g(β T 0 X i )−Y i are sub-Gaussian. According to Lemma A8, the event γ − γ 0 2 ≤ 1 occurs with large probability, on which we have a bound for Thus, we obtain from (A.18) that Y i − g( γ T W i ) is sub-Gaussian with large probability.
Thus by the properties of sub-Gaussian random variables in Lemma A4-d and A4-f, we have established that the elements in the summands of˙ † (β 0 ; γ) are all sub-exponential random variables conditionally on the labelled data. We apply the Bernstein's inequality (Lemma A4-h) conditionally on the labelled data to obtain This establishes the order for λ β , Setting λ β log(p)/N for optimal L 2 estimation, we achieve the stated conclusion by applying the rates from Lemma A8 and (A.19). For optimal L 1 estimation, we set a larger penalty λ β log(p)/N ∨ s γ log(p + q)/(s β n) λ β to achieve

B2 Proof of Corollary 2
Under Assumption 1b, we have The tail distribution is regulated by the sub-Gaussian norm by Lemma A4-a, Combining (A.20) and (A.21), we obtain Thus,

B3 Proof of Theorem 3
Our proof is organized in five parts. In Part 1, we establish the consistency of the crossfitting estimator for precision matrix, namely u (k) − u 0 2 = o p (1) with u (k) and u 0 defined in (20) and (19), respectively. In Part 2, we show that the debiased estimator can be approximated by the empirical process As long as the asymptotic variance V SAS defined in (29) is bounded and bounded away from zero, we have the asymptotic normality of the leading term from the Central Limit In Part 3, we deal with the asymptotic variance V SAS and the consistency of the variance estimator V SAS defined in (23). In Part 4, we reach the conclusion of the theorem based on, for all 1 ≤ k ≤ K. Following Part 4, we show in Part 5 that (28) implies (A.22).

Part 1: Consistency of estimated precision matrix
The definitions of u 0 and u (k) are given in (19) and (20). In this part, we show Since we set the number of folds K ≤ 10 to be finite, the estimation rate applies for u (k) for all k = 1, . . . , K.
We denote the components in the quadratic loss function of (20) and their derivatives as for k ∈ {1, . . . , K} \ {k}. We may express (20) as Similar to the proof of Theorem 1, we establish the estimation rate for u through an oracle inequality, Lemma A2. Under Assumptions 1, 2, we establish On event we have the oracle inequality for estimation error of β, The constants κ * rsc,1 , κ * rsc,2 are the restrictive strong convexity parameters specified in Lemma A7.
The proof of Lemma A2 repeats the proof of the oracle inequality for Theorem 1, so we put the detail to Section C.
To use Lemma A2 for the estimation rate of u, we only need to verify two conditions. First, the event Ω (k) occurs with probability tending to one. Second, the oracle choice of λ u is of order log(p)/N . Repeating Theorem 1 for each β (k,k ) , we have under (28) Then by Lemma A7, the sets whose intersection forms Ω (k) each occurs with probability tending to one. Since we set the number of fold finite K ≤ 10, we can take union bound to obtain that Ω (k) occurs with probability tending to one.
We may writė Each element in (A.24) is an empirical process. Under Assumptions 1b and 2a, we can show that each summand is a sub-exponential random variable by Lemma A4-e, A4-f, Hence, we can apply the Bernstein's inequality to show that Using the fact that N k N , we obtain that the oracle λ u is of order O p log(p)/N . Therefore, we can apply Lemma A2 to obtain

Part 2: Asymptotic approximation
Under Assumption 2b-i, we also have the tightness of u (k) 2 from the bound of u 0 2 Define the scores of in-fold data aṡ †(k) (β; γ) = 1 Since x T std β is the average over K (at most 10) cross-fitted estimators, it suffices to study one of the cross-fitted estimators, .

(A.27)
We denote the expected Hessian matrices of losses in (A.26) as Our analysis of the approximation error is based on the first order Mean Value Theorem identity, for some β on the path fromβ( γ) to β 0 and some γ on the path from γ to γ 0 . The conditional expectation notation is declared at Definition A1. Based on (A.29), we analyze the approximation error for Here we state the rates for T 1 -T 5 , With the assumed estimation rate in (A.22), we have Thus, we have shown Using the indicator R i = I(i ≤ n), we can alternatively write We provide the details of T 1 -T 5 in Section C2.

Part 3: Variance estimation
Finally, we show that asymptotic variance V SAS defined in (29) is bounded from infinity and zero with the consistent estimator V SAS defined in (23).
By the Cauchy-Schwartz inequality, we have a bound for the variance Under Assumptions 1a, 2a, we have the sub-Gaussian and sub-exponential variables By the bound for the moments of sub-Gaussian and sub-exponential random variables stated in Lemma A4-b, we have Under Assumptions 1b, 2a, 2b-i and 2c, we have a lower bound for V SAS , which is bounded away from zero.
We analyze the estimation error of variance V SAS − V SAS through the decomposition, Here we state the rates for T 1 -T 4 , With the assumed estimation rate in (A.22), we have We provide the details of T 1 -T 4 in Section C2.

Part 4: Conclusion with estimation rates
From the approximation in Part 2 and the boundedness and non-degeneracy of V SAS in Part 3, we have shown the asymptotic normality of the cross-fitted debiased estimator Together with the consistency of V SAS in Part 3, we have

Part 5: Sufficient dimension condition
We have established the rate of estimation for γ, β and u from Lemma A8, Theorem 1 and Part 4 of this proof above. Since we only keep one fold of the data away for the cross-fitted estimators, they follow the same rates of estimation, Applying the rates of estimation, we show dimension assumption (28) is sufficient for (A.22).

B4 Efficiency of SAS Inference
Relative Efficiency to Supervised Learning Proof of Proposition 5. We prove the Proposition by direct calculation The last expression is the sum of expectations of complete squares, so it must be nonnegative. Thus, we have shown that the SAS asymptotic variance is no greater than the supervised learning variance. The equality holds only if 1) ρ = 1 all samples are labelled; 2) or ρ = 0 and u T 0

Efficiency Bound among Semi-parametric RAL Estimators
Proof of Proposition 6. The proof follows the flow of Section D.2 in Kallus and Mao (2020).
The semi-parametric model for the observed data is We consider the parametric sub-model The score vector of the parametric sub-model is The orthogonal space of model tangent space Λ is Now, we verify that the supervised learning influence function Since β(ζ) is an implicit function of ζ through the moment condition we solve for its derivative by differentiating the moment condition Then, we verify that the supervised learning influence function is valid Finally, we derive the efficient influence function by subtract from φ SL its projection be the projection of h(D) ∈ H to the space Λ. We can easily calculate the projection of φ SL onto Λ R , The efficient influence function is thus obtained

C1 General
Lemma A3. Under Assumptions 1a, 1b, 2a, the residuals of the imputed loss are sub-Gaussian random variables , Similarly, Proof of Lemma A3. To establish the sub-exponential tail, we consider the following de- According to Assumption 1a, the first two terms on the right-hand side of (A.37) are sub-Gaussian, According to Assumption 1b, the latter two terms on the right-hand side of (A.37) are bounded by Finally, we apply Lemma A4-d Therefore, we have reached the conclusion.
We may obtain the rest of bounds following the same derivation.

Analysis of Estimated Precision Matrix
Proof of Lemma A2. The definition of the cross-fitted loss functions m (k,k ) and their derivatives can be found at (A.23). By the definition of u (k) , we have Denote the standardized estimation error as δ = ( u (k) − u 0 )/ u (k) − u 0 2 . Due to convexity of the loss function, we have for By the triangle inequality u 0 1 − u 0 + tδ 1 ≤ t δ 1 , we have from (A.38) Because the loss functions m (k) are quadratic functions of u, we can apply the restricted strong convexity event Ω (k) to obtain Applying (A.40) to (A.39), we have with large probability where δ 2 = 1 from definition. Thus, we have reach The target parameter u 0 can be identify by E ṁ (k,k ) u 0 ; β (k,k ) | D c k = 0. We use the fact to do a careful analysis of δ Tṁ(k,k ) u 0 ; β (k,k ) by the decomposition We establish the rate for L 2 -norm of the population score at u 0 through analyzing whose bound can be derived from Assumptions 1b, 2a, 2a, the Cauchy-Schwartz inequality and Lemma A4-b, Hence, we have shown Hence, we can reach an immediate bound for estimation error from (A.44) without considering the sparsity of u 0 . We shall proceed to derive a sharper bound that involves the sparsity of u 0 . We separately analyze two cases.
Case 1: In this case, the estimation error is dominated by β 0 − β (k,k ) . We simply have from (A.44) Thus, we have Case 2: In this case, the estimation error is comparable to the situation that we have the true β 0 for the Hessian. Thus, the sparsity of u 0 may affect the estimation error.
Following the typical approach to establish the cone condition for δ, we analyze the symmetrized Bregman's divergence, Due to the convexity of the quadratic loss m (k,k ) ·; β (k,k ) , the symmetrized Bregman's divergence (A.47) is nonnegative through a mean-value theorem, Denote the indices set of nonzero coefficient in u 0 as O u = {j : u 0,j = 0}. We denote the δ Ou and δ O c u as the sub-vectors for δ at positions in O u and at positions not in O u , respectively. The solution u (k) satisfies the KKT condition From the KKT condition and the definitions of δ and O u , we have Applying the (A.48) to (A.47), we have the upper bound, .
Then, we apply (A.42), the definition of λ u and (A.46), Therefore, we can bound the L 1 norm of δ by the cone property, Now, we apply the cone condition (A.49) and the case condition (A.46) to the bound (A.44), Thus, we obtain the rate for estimation error

Conclusion:
Since Case 1 and Case 2 are the complement of each other, one of them must occur.
Thus, the bound of estimation error is controlled by the larger bound in the two cases, which is our oracle inequality.
Analysis for Terms T 1 -T 5 in Part 1 To show we rewrite the term as a conditional expectation Under Assumptions 1b, 2a, we derive the bound for the expectation using the Cauchy-Schwartz inequality and Lemma A4-b, Since u 0 2 is bounded according to (A.25), we have established in as declared.
To show we rewrite the term as a conditional expectation Similar to (A.51), we derive the bound for the expectation under Assumptions 1b, 2a through the Cauchy-Schwartz inequality and Lemma A4-b, A4-f, This bound immediately implies

To show
we rewrite the term as two empirical processes with diminishing summands We have used the identity E{˙ †(k) (β 0 ; γ 0 ) | D c k } = 0 above. Using Lemmas A3, A4-h and Assumptions (2a) and (2a), we show that each summand is sub-exponential Applying the Bernstein's inequality, we obtain We achieve the stated rate with the tightness of u (k) 2 from (A.25). To show we rewrite the term as the empirical process with diminishing summands We have used the identity E{˙ (k) imp (γ 0 ) | D c k } = 0 above. Similar to the analysis of T 3 , we show that each summand is sub-exponential Applying the Bernstein's inequality, we obtain We achieve the stated rate with the tightness of u (k) 2 from (A.25). To show we rewrite the term as the empirical process with diminishing summands The summands have zero mean because Similar to the analysis of T 3 , we show that each summand is sub-exponential Applying the Bernstein's inequality, we obtain

Analysis for Terms T 1 -T 4 in Part 2
Conditionally on the out-of-fold data, the term T 1 is the empirical average of i.i.d. mean zero random variables, Under Assumption 1a, 1b, 2a, we have We apply Lemma A3 to obtain We have shown that the variance in (A.52) is of order Thus by the Tchebychev's inequality, we obtain T 1 = O p 1 + u (k) − u 0 2 + ρ β To analyze T 2 , we consider the decomposition in which the estimators are replaced by the estimands one by one, Following the same calculation as in (A.52), we can bound the expectations Applying the consistency of γ, β and u from Lemma A8, Theorem 1 and Part 1 in the proof of Theorem 3, we have established Repeating the analyses for T 1 and T 2 , we can show

D1 Definitions
We adopt the following definition of sub-Gaussian and sub-exponential random variables.
The random variable V is sub-Gaussian if V ψ 2 is finite. The sub-Gaussian parameter for a random vector U is defined as The sub-Gaussian parameter for a random variable V is defined as The random variable V is sub-exponential if V ψ 1 is finite. The more general Orlicz norm for α ∈ (0, 1) is defined as Mimicking the (minimal) Restricted Eigenvalue condition on the minimal eigenvalue of matrix over a cone (Bickel et al., 2009)

D2 Statements of Existing Results
The properties in Lemmas A4 and A5 are covered in Vershynin (2018) Chapter 2 and 4.
Lemma A4 (Properties of sub-Gaussian and sub-exponential random variables).
See Definition A1 for the definition of conditional expectation notation. The constants are all absolute.
The two inequalities in Lemma A6 are direct application of Negahban et al. (2010) Proposition 2 page 22. We can construct an auxiliary loss function to prove the following lemma.
Proof of Lemma A7. First, we show g β (k,k )T X i X i is a sub-Gaussian random vector whose second moment has all eigenvalues bounded away from infinity and zero. Under Assumptions 1b and 2a, we may apply Lemma A4-e, v T g β (k,k )T Thus, g β (k,k )T X i X i is a sub-Gaussian random vector. Under Assumptions 1b and 2a, we can bound the maximal eigenvalue of its second moment, v T E i∈I k ∪J k g β (k,k )T We derive the lower bound for the minimal eigenvalue of its second moment from Assumptions 1b, 2a, 2a, 2b-i, the Cauchy-Schwartz inequality and Lemma A4-b, v T E i∈I k ∪J k g β (k,k )T Second, we construct an auxiliary least square loss to apply Negahban et al. (2010).
Let ε i be independent standard normal random variables. Construct the loss function By the design, we have We apply Proposition 2 in Negahban et al. (2010) for L (k,k ) (v) conditionally on out-of-fold data D c k and the event β Lemma A8. For a constant κ cone (n, p, q, ε r ) s γ log(p + q)/n, the event ≤ κ cone (n, p, q, ε r ) occur with probability greater than 1 − ε r under Assumptions 1a and 2a. Setting λ γ = 2, we have on event Ω cone that where O γ = {j : γ j = 0} is the indices set for nonzero coefficient in γ 0 . Moreover, we have γ − γ 0 2 = O p s γ log(p + q)/n .
The concentration on the event Ω cone is established by the union bound of element wise concentration, which is in turn obtained by the Bernstein inequality for sub-exponential random variables (Lemma A4-h). The rest of Lemma A8 follows Huang and Zhang (2012) Lemma 1 page 5 (page 1843 of the issue) and Negahban et al. (2010) Corollary 5 page 23.